CombiTagger: A System for Developing Combined Taggers
نویسندگان
چکیده
The main task of part-of-speech (PoS) tagging is to assign the appropriate morphosyntactic category to each word in a sentence. A combination of different PoS taggers usually results in higher tagging accuracy than obtained by the use of only a single tagger. We present a new language and tagset independent system, CombiTagger, which combines automatically the output of several taggers. The system, which is open source, provides algorithms for simple and weighted voting, but it is extensible so that other combination algorithms can be added easily. We demonstrate the functionality of CombiTagger by using it to develop and evaluate combined taggers for Icelandic. The most accurate individual tagger obtains an accuracy of 91.83%. CombiTagger achieves 93.09%-93.41% accuracy by combining the output of five or six taggers using simple and
منابع مشابه
Something Borrowed, Something Blue: Rule-based Combination of POS Taggers
Linguistically annotated text resources are still scarce for many languages and for many text types, mainly because their creation represents a major investment of work and time. For this reason, it is worthwhile to investigate ways of reusing existing resources in novel ways. In this paper, we investigate how off-the-shelf part of speech (POS) taggers can be combined to better cope with text m...
متن کاملHybrid Techniques for Training Hmm Part-of-speech Taggers
We describe and experimentally evaluate a hybrid technique for training part-of-speech taggers which utilises training from small quantities of unambiguously-tagged material combined with maximum likelihood re-estimation over the target untagged corpus. This approach, unlike previous ones employing re-estimation, does not involve skilled manipulation of the initial parameters of the model or th...
متن کاملBuilding Domain-Specific Taggers without Annotated (Domain) Data
Part of speech tagging is a fundamental component in many NLP systems. When taggers developed in one domain are used in another domain, the performance can degrade considerably. We present a method for developing taggers for new domains without requiring POS annotated text in the new domain. Our method involves using raw domain text and identifying related words to form a domain specific lexico...
متن کاملTagging the Past: Experiments using the Saga Corpus
There is an increasing interest in the NLP community in developing tools for annotating historical data, for example, to facilitate research in the field of corpus linguistics. In this work, we experiment with several PoS taggers using a sub-corpus of the Icelandic Saga Corpus. This is carried out in three main steps. First, we evaluate taggers, which were trained on Modern Icelandic, when tagg...
متن کاملWriting Annotation Instructions
In two corpus annotation projects, we followed similar strategies for developing annotation instructions and obtained good inter-coder reliability results for both (the instructions are similar in style to Allen & Core 1996). Our goal in developing the annotation instructions was that they can be used reliably, after a reasonable amount of training, by taggers who are non-experts but who have g...
متن کامل